AITopics | data reduction

Collaborating Authors

data reduction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Consolidated Cross-Validation Algorithm for Support Vector Machines via Data Reduction

Neural Information Processing SystemsDec-23-2025, 16:42:39 GMT

We propose a consolidated cross-validation (CV) algorithm for training and tuning the support vector machines (SVM) on reproducing kernel Hilbert spaces. Our consolidated CV algorithm utilizes a recently proposed exact leave-one-out formula for the SVM and accelerates the SVM computation via a data reduction strategy. In addition, to compute the SVM with the bias term (intercept), which is not handled by the existing data reduction methods, we propose a novel two-stage consolidated CV algorithm. With numerical studies, we demonstrate that our algorithm is about an order of magnitude faster than the two mainstream SVM solvers, kernlab and LIBSVM, with almost the same accuracy.

consolidated cross-validation algorithm, name change, support vector machine, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.71)

Add feedback

Edge-Based Predictive Data Reduction for Smart Agriculture: A Lightweight Approach to Efficient IoT Communication

Krekovic, Dora, Kusek, Mario, Zarko, Ivana Podnar, Le-Phuoc, Danh

arXiv.org Artificial IntelligenceNov-25-2025

The rapid growth of IoT devices has led to an enormous amount of sensor data that requires transmission to cloud servers for processing, resulting in excessive network congestion, increased latency and high energy consumption. This is particularly problematic in resource-constrained and remote environments where bandwidth is limited, and battery-dependent devices further emphasize the problem. Moreover, in domains such as agriculture, consecutive sensor readings often have minimal variation, making continuous data transmission inefficient and unnecessarily resource intensive. To overcome these challenges, we propose an analytical prediction algorithm designed for edge computing environments and validated through simulation. The proposed solution utilizes a predictive filter at the network edge that forecasts the next sensor data point and triggers data transmission only when the deviation from the predicted value exceeds a predefined tolerance. A complementary cloud-based model ensures data integrity and overall system consistency. This dual-model strategy effectively reduces communication overhead and demonstrates potential for improving energy efficiency by minimizing redundant transmissions. In addition to reducing communication load, our approach leverages both in situ and satellite observations from the same locations to enhance model robustness. It also supports cross-site generalization, enabling models trained in one region to be effectively deployed elsewhere without retraining. This makes our solution highly scalable, energy-aware, and well-suited for optimizing sensor data transmission in remote and bandwidth-constrained IoT environments.

artificial intelligence, cloud computing, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.19103

Country: Europe > Croatia (0.15)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology (1.00)
Food & Agriculture > Agriculture (1.00)
Energy (1.00)

Technology:

Information Technology > Internet of Things (1.00)
Information Technology > Data Science (1.00)
Information Technology > Communications > Networks (1.00)
(2 more...)

Add feedback

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Zhang, Hengrui, Hui, Yulong, Liu, Yihao, Zhang, Huanchen

arXiv.org Artificial IntelligenceSep-17-2025

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2509.1261

Country:

Asia (0.67)
North America > Mexico (0.28)
Europe > Italy (0.28)

Genre:

Overview (1.00)
Research Report (0.82)
Workflow (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

What Data is Really Necessary? A Feasibility Study of Inference Data Minimization for Recommender Systems

Leysen, Jens, Favier, Marco, Goethals, Bart

arXiv.org Artificial IntelligenceSep-1-2025

Data minimization is a legal principle requiring personal data processing to be limited to what is necessary for a specified purpose. Operationalizing this principle for recommender systems, which rely on extensive personal data, remains a significant challenge. This paper conducts a feasibility study on minimizing implicit feedback inference data for such systems. We propose a novel problem formulation, analyze various minimization techniques, and investigate key factors influencing their effectiveness. We demonstrate that substantial inference data reduction is technically feasible without significant performance loss. However, its practicality is critically determined by two factors: the technical setting (e.g., performance targets, choice of model) and user characteristics (e.g., history size, preference complexity). Thus, while we establish its technical feasibility, we conclude that data minimization remains practically challenging and its dependence on the technical and user context makes a universal standard for data `necessity' difficult to implement.

artificial intelligence, machine learning, minimization, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746252.3761058

2508.21547

Country:

South America > Brazil (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.05)
(21 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Statutes (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Supplemental Materials: A Consolidated Cross-V alidation Algorithm for Support Vector Machines via Data Reduction A Technical Proofs A.1 Some details of Lemma 2.1 Since matrix K

Neural Information Processing SystemsAug-13-2025, 16:22:31 GMT

Alternatively, one can use random features (Rahimi and Recht, 2007) to approximate the kernel matrix.

artificial intelligence, inequality, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)

Add feedback

A Consolidated Cross-Validation Algorithm for Support Vector Machines via Data Reduction

Neural Information Processing SystemsAug-13-2025, 16:22:28 GMT

The success of the SVM is mainly attributed to its appealing geometric interpretation, solid theoretical foundation, and high predictive power.

algorithm, artificial intelligence, machine learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Iowa > Johnson County > Iowa City (0.14)
North America > Canada > Quebec > Montreal (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)

Add feedback

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Chen, Fei, Zhou, Wenchi

arXiv.org Artificial IntelligenceAug-11-2025

In order to increase the effectiveness of model training, data reduction is essential to data-centric Artificial Intelligence (AI). It achieves this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is choosing the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise V-Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our findings imply that training a classifier on the chosen optimal subset may improve model performance and increase training efficiency when combined with an efficient data reduction strategy. Furthermore, we have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese Natural Language Processing (NLP) tasks and base models, yielding insightful results for faster training and cross-lingual data reduction.

data quality, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.00038

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre:

Instructional Material (0.93)
Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Scale Efficient Training for Large Datasets

Zhou, Qing, Gao, Junyu, Wang, Qi

arXiv.org Artificial IntelligenceMar-17-2025

The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.13385

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

SoK: Knowledge is All You Need: Last Mile Delivery for Automated Provenance-based Intrusion Detection with LLMs

Cheng, Wenrui, Zhu, Tiantian, Xiong, Chunlin, Sun, Haofei, Wang, Zijun, Jing, Shunan, Lv, Mingqi, Chen, Yan

arXiv.org Artificial IntelligenceMar-4-2025

Recently, provenance-based intrusion detection systems (PIDSes) have been widely proposed for endpoint threat analysis. However, due to the lack of systematic integration and utilization of knowledge, existing PIDSes still require significant manual intervention for practical deployment, making full automation challenging. This paper presents a disruptive innovation by categorizing PIDSes according to the types of knowledge they utilize. In response to the prevalent issue of ``knowledge silos problem'' in existing research, we introduce a novel knowledge-driven provenance-based intrusion detection framework, powered by large language models (LLMs). We also present OmniSec, a best practice system built upon this framework. By integrating attack representation knowledge, threat intelligence knowledge, and benign behavior knowledge, OmniSec outperforms the state-of-the-art approaches on public benchmark datasets. OmniSec is available online at https://anonymous.4open.science/r/PIDS-with-LLM-613B.

detection, graph, node, (16 more...)

arXiv.org Artificial Intelligence

2503.03108

Country: Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre:

Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.46)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Clear Sky Corridor: Insights Towards Aerosol Formation in Exoplanets Using An AI-based Survey of Exoplanet Atmospheres

Ashtari, Reza, Stevenson, Kevin B., Sing, David, Lopez-Morales, Mercedes, Alam, Munazza K., Nikolov, Nikolay K., Evans-Soma, Thomas M.

arXiv.org Artificial IntelligenceDec-20-2024

Producing optimized and accurate transmission spectra of exoplanets from telescope data has traditionally been a manual and labor-intensive procedure. Here we present the results of the first attempt to improve and standardize this procedure using artificial intelligence (AI) based processing of light curves and spectroscopic data from transiting exoplanets observed with the Hubble Space Telescope's (HST) Wide Field Camera 3 (WFC3) instrument. We implement an AI-based parameter optimizer that autonomously operates the Eureka pipeline to produce homogeneous transmission spectra of publicly available HST WFC3 datasets, spanning exoplanet types from hot Jupiters to sub-Neptunes. Surveying 42 exoplanets with temperatures between 280 and 2580 Kelvin, we confirm modeled relationships between the amplitude of the water band at 1.4um in hot Jupiters and their equilibrium temperatures. We also identify a similar, novel trend in Neptune/sub-Neptune atmospheres, but shifted to cooler temperatures. Excitingly, a planet mass versus equilibrium temperature diagram reveals a "Clear Sky Corridor," where planets between 700 and 1700 Kelvin (depending on the mass) show stronger 1.4um H2O band measurements. This novel trend points to metallicity as a potentially important driver of aerosol formation. As we unveil and include these new discoveries into our understanding of aerosol formation, we enter a thrilling future for the study of exoplanet atmospheres. With HST sculpting this foundational understanding for aerosol formation in various exoplanet types, ranging from Jupiters to sub-Neptunes, we present a compelling platform for the James Webb Space Telescope (JWST) to discover similar atmospheric trends for more planets across a broader wavelength range.

artificial intelligence, machine learning, optimization problem, (16 more...)

arXiv.org Artificial Intelligence

2410.06804

Country:

North America > United States > Maryland > Baltimore (0.04)
Oceania > Australia > New South Wales > Callaghan (0.04)
North America > United States > Maryland > Prince George's County > Laurel (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback